5420 Anomaly Detection, Fall 2020

Assignment 8: Supervised Machine Learning

Submitted by: Harsh Dhanuka, hd2457

Objectives

It is important in any data science project to define the objective as specific as possible. Below let's write it from general to specific. This will direct our analysis.

  • Build a model for Loan Default Prediction
  • Build a model for Loan Default Prediction using the Random Forest function
  • Build a model for Loan Default Prediction using the Random Forest function in the H2O package
  • Build a model for Loan Default Prediction using the Random Forest function in the H2O package, manually performing under and over sampling

Please click Section 4 to directly go to the Random Forest Models

Table of Contents

Section 1: Initial Steps

Section 2: Data Cleaning and Preparation, Feature Engineering

Section 3: EDA of all variables and binning

Section 4: Random Forest Models

Random Forest Model:

Random Forest is a type of Bagging Model. The term Bagging comes from Bootstrap Aggregating. It builds many models independently and then averages their predictions. The best-known example is the random forest technique. The random forest method builds many decision trees, and then takes the average for the outcomes of all the decision trees. Further, the random forest technique draws some samples to build a model, then draws some samples again to build another model, and so on. All of these sampling and modeling are done independently and simultaneously as shown in Figure (II). The final outcome is the average of the predicted values of all the models.

H2O package

H2O is a fully open source, distributed in-memory machine learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient boosted machines, generalized linear models, deep learning and more. H2O also has an industry leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.

1. Initial Steps

In [1]:
# Import all packages
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")

from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import roc_curve, auc, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.metrics import average_precision_score
from sklearn.metrics import precision_recall_curve

import plotly
import plotly.express as px

from imblearn.datasets import make_imbalance
import pylab as pl
from collections import Counter

1.1. Load Data

In [2]:
# Read the data
df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 2 EDA/XYZloan_default_selected_vars.csv')
df.head(2)
Out[2]:
Unnamed: 0 Unnamed: 0.1 id loan_default AP001 AP002 AP003 AP004 AP005 AP006 ... CD162 CD164 CD166 CD167 CD169 CD170 CD172 CD173 MB005 MB007
0 0 1 1 1 31 2 1 12 2017/7/6 10:21 ios ... 13.0 13.0 0.0 0.0 1449.0 1449.0 2249.0 2249.0 7.0 IPHONE7
1 1 2 2 0 27 1 1 12 2017/4/6 12:51 h5 ... -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 -99.0 NaN WEB

2 rows × 89 columns

1.2. Basic Summary Check

In [3]:
print("Number of rows and columns in the dataset:")
df.shape
Number of rows and columns in the dataset:
Out[3]:
(80000, 89)
In [4]:
# Check basic statistics
print("Basic statistics of the columns are as follows:")
df.describe()
Basic statistics of the columns are as follows:
Out[4]:
Unnamed: 0 Unnamed: 0.1 id loan_default AP001 AP002 AP003 AP004 AP007 AP008 ... CD160 CD162 CD164 CD166 CD167 CD169 CD170 CD172 CD173 MB005
count 80000.000000 80000.000000 80000.000000 80000.000000 80000.000000 80000.000000 80000.000000 80000.000000 80000.00000 80000.000000 ... 79619.000000 79619.000000 79619.000000 79619.000000 79619.000000 79619.000000 79619.00000 79619.000000 79619.000000 77207.000000
mean 39999.500000 40000.500000 40000.500000 0.193600 31.706913 1.321813 2.014925 11.235413 3.30130 3.117200 ... 6.911956 14.271694 11.773358 909.089313 810.786219 1732.693314 1539.33443 2513.226491 2229.606137 5.976272
std 23094.155105 23094.155105 23094.155105 0.395121 7.075070 0.467174 1.196806 2.212313 1.33655 1.306335 ... 28.007499 38.235012 33.270641 1379.553332 1245.044602 2441.503517 2172.71384 3404.975112 3005.615048 3.641814
min 0.000000 1.000000 1.000000 0.000000 20.000000 1.000000 1.000000 3.000000 1.00000 1.000000 ... -99.000000 -99.000000 -99.000000 -99.000000 -99.000000 -99.000000 -99.00000 -99.000000 -99.000000 0.000000
25% 19999.750000 20000.750000 20000.750000 0.000000 27.000000 1.000000 1.000000 12.000000 2.00000 2.000000 ... 2.000000 5.000000 4.000000 84.000000 34.000000 309.000000 226.00000 539.000000 414.000000 3.000000
50% 39999.500000 40000.500000 40000.500000 0.000000 30.000000 1.000000 1.000000 12.000000 3.00000 3.000000 ... 7.000000 12.000000 10.000000 475.000000 397.000000 1023.000000 870.00000 1553.000000 1324.000000 5.000000
75% 59999.250000 60000.250000 60000.250000 0.000000 35.000000 2.000000 3.000000 12.000000 5.00000 4.000000 ... 14.000000 23.000000 20.000000 1209.000000 1080.000000 2287.000000 2030.00000 3296.000000 2936.000000 8.000000
max 79999.000000 80000.000000 80000.000000 1.000000 56.000000 2.000000 6.000000 12.000000 5.00000 5.000000 ... 1061.000000 2792.000000 1579.000000 48585.000000 29664.000000 88364.000000 54651.00000 125352.000000 87312.000000 47.000000

8 rows × 86 columns

1.3. Basic EDA and considerations

a. Basic EDA of one variable AP006

In [5]:
df['AP006'].hist()
df.AP006.hist()
df['AP006'].value_counts()
Out[5]:
h5         44246
ios        17159
android    17140
api         1455
Name: AP006, dtype: int64

b. EDA of the target variable loan_default

In [6]:
# Check the target variable column
print("The number of 0's and 1's are:")
print(df['loan_default'].value_counts())

df['loan_default'].hist()
The number of 0's and 1's are:
0    64512
1    15488
Name: loan_default, dtype: int64
Out[6]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fbc86db5510>

c. Check the column data types, and NA's

In [7]:
#df.info()

d. Feature considerations from eyeballing the data types

  1. The first 3 variables are ID columns, cannot be used for predictions. Unnamed: 0, Unnamed: 0.1 and id.

They need to be dropped.

  1. AP005 is a Date-Time column, which cannot be used for any predictions in the model. Date-Time columns act as an ID column and all have unique values, which misrepresents the variable while making predictions. The reason is that this field almost becomes a unique identifier for each record. It is as if you employ the ‘id’ field in your decision trees.

I will derive year, month, day, weekday, etc. from this field. In some models, we may use ‘year’ as a variable just to explain any special volatility in the past. But we will never use the raw DateTime field as a predictor.

  1. The following columns have 0 as their value in all entries, and hence, they need to be removed from any model predictions.

TD025, TD026, TD027, TD028, CR012.

  1. The following columns have the same value in all entries, and hence, they need to be removed from any model predictions.

TD029, TD044, TD048, TD051, TD054, TD055, TD061, TD062.

  1. Check for categorical fields from the data variable descriptions. Convert the relevant numeric fields to their respective categorical fields:

AP002Gender, AP003Education Code, AP004Loan Term, AP006OS Type, AP007Application City Level, AP008Flag if City not Application City, AP009 Binary format, MB007 Mobile Brands/type

2. Data Cleaning and Preparation, Feature Engineering

2.1. Convert the DateTime column AP005 to the relevant formats of Year, Month, Day

In [8]:
df['AP005'] =  pd.to_datetime(df['AP005'])
In [9]:
# Create 4 new columns
df['Loan_app_day_name'] = df['AP005'].dt.day_name()
df['Loan_app_month'] = df['AP005'].dt.month_name()
df['Loan_app_time'] = df['AP005'].dt.time
df['Loan_app_day'] = df['AP005'].dt.day
In [10]:
# Drop old column
df = df.drop(columns = ['AP005']) 
df.head(2)
Out[10]:
Unnamed: 0 Unnamed: 0.1 id loan_default AP001 AP002 AP003 AP004 AP006 AP007 ... CD169 CD170 CD172 CD173 MB005 MB007 Loan_app_day_name Loan_app_month Loan_app_time Loan_app_day
0 0 1 1 1 31 2 1 12 ios 3 ... 1449.0 1449.0 2249.0 2249.0 7.0 IPHONE7 Thursday July 10:21:00 6
1 1 2 2 0 27 1 1 12 h5 5 ... -99.0 -99.0 -99.0 -99.0 NaN WEB Thursday April 12:51:00 6

2 rows × 92 columns

2.2. Convert the misrepresented numerical categorical variables back to relevant category/object format

In [11]:
df["AP002"] = df["AP002"].astype('object')

df["AP003"] = df["AP003"].astype('object')
df["AP004"] = df["AP004"].astype('object')
df["AP006"] = df["AP006"].astype('object')
df["AP007"] = df["AP007"].astype('object')
df["AP008"] = df["AP008"].astype('object')
df["AP009"] = df["AP009"].astype('object')

df["CR015"] = df["CR015"].astype('object')

df["MB007"] = df["MB007"].astype('object')

df['Loan_app_day_name'] = df['Loan_app_day_name'].astype('object')
df['Loan_app_month'] = df['Loan_app_month'].astype('object')
df['Loan_app_time'] = df['Loan_app_time'].astype('object')
df['Loan_app_day'] = df['Loan_app_day'].astype('object')

2.3. Drop all blank value/id columns

In [12]:
df = df.drop(columns = ['Unnamed: 0', 'Unnamed: 0.1', 'id', 'TD025', 'TD026', 'TD027', 'TD028', 'CR012','TD029', 'TD044', 'TD048', 'TD051', 'TD054', 'TD055', 'TD061', 'TD062']) 
df.head(2)
Out[12]:
loan_default AP001 AP002 AP003 AP004 AP006 AP007 AP008 AP009 TD001 ... CD169 CD170 CD172 CD173 MB005 MB007 Loan_app_day_name Loan_app_month Loan_app_time Loan_app_day
0 1 31 2 1 12 ios 3 3 1 1 ... 1449.0 1449.0 2249.0 2249.0 7.0 IPHONE7 Thursday July 10:21:00 6
1 0 27 1 1 12 h5 5 4 0 2 ... -99.0 -99.0 -99.0 -99.0 NaN WEB Thursday April 12:51:00 6

2 rows × 76 columns

2.4. Convert all the negative or mis-read values such as -99, etc, to 'nan' for imputation

As per all the variable description, all the following columns are either counts, lengths, or days. Hence, the negative values such as -999, -99, -98, -1, etc are all mis-read NA's and need to be converted back to 'nan' format.

In [13]:
features_nan = ['AP001',
                'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
       'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
       'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
       'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
       'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
       'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
       'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
       'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005']
In [14]:
# Define a function to convert negatives to nan
def convert_to_nan(var):
    df[var][df[var] < 0] = np.nan
In [15]:
for i in features_nan:
    convert_to_nan(i)
In [16]:
# Verify that the negatives are gone
print("The minimum now stands at 0 for most of the columns, verifying the mis-represented values are gone.")
df[features_nan].describe()
The minimum now stands at 0 for most of the columns, verifying the mis-represented values are gone.
Out[16]:
AP001 TD001 TD002 TD005 TD006 TD009 TD010 TD013 TD014 TD015 ... CD160 CD162 CD164 CD166 CD167 CD169 CD170 CD172 CD173 MB005
count 80000.000000 80000.000000 80000.000000 80000.000000 80000.000000 80000.00000 80000.000000 80000.000000 80000.000000 80000.000000 ... 76312.000000 76312.000000 76312.000000 76312.000000 76312.000000 76312.000000 76312.000000 76312.000000 76312.000000 77207.000000
mean 31.706913 1.986962 0.706213 3.593037 1.345700 5.40600 2.020812 6.804737 2.603662 0.718775 ... 11.501677 19.180352 16.573750 952.775121 850.212037 1812.070212 1610.332071 2626.427993 2330.516878 5.976272
std 7.075070 1.807445 0.918347 2.799570 1.413362 4.02311 1.973988 5.128183 2.505840 0.882962 ... 17.641851 30.743372 24.496918 1392.729146 1256.936168 2463.242747 2191.780118 3433.330482 3029.857757 3.641814
min 20.000000 0.000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
25% 27.000000 1.000000 0.000000 2.000000 0.000000 3.00000 1.000000 3.000000 1.000000 0.000000 ... 3.000000 6.000000 5.000000 123.000000 76.000000 382.000000 294.000000 649.000000 512.000000 3.000000
50% 30.000000 2.000000 0.000000 3.000000 1.000000 4.00000 2.000000 6.000000 2.000000 0.000000 ... 7.000000 13.000000 11.000000 518.000000 437.000000 1098.000000 940.000000 1658.000000 1423.000000 5.000000
75% 35.000000 3.000000 1.000000 5.000000 2.000000 7.00000 3.000000 9.000000 4.000000 1.000000 ... 14.000000 24.000000 21.000000 1258.000000 1123.000000 2369.000000 2107.000000 3417.000000 3037.000000 8.000000
max 56.000000 20.000000 11.000000 24.000000 21.000000 46.00000 35.000000 52.000000 43.000000 7.000000 ... 1061.000000 2792.000000 1579.000000 48585.000000 29664.000000 88364.000000 54651.000000 125352.000000 87312.000000 47.000000

8 rows × 61 columns

2.5. Fill the NA's using Iterative Imputer

Multivariate imputer that estimates each feature from all the others. A strategy for imputing missing values by modeling each feature with missing values as a function of other features in a round-robin fashion.

Instead of using the traditional simple mean, median or mode method of imputation, I will be using the smart method of Iterative Imputation from sklearn package.

The documentation is here: https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

In [17]:
from sklearn.experimental import enable_iterative_imputer  # noqa
from sklearn.impute import IterativeImputer

Make a subset of the original df with only the numeric/float64 variables which have the NA's

In [18]:
df_2 = df[features_nan]
In [19]:
# Verify

df_2.head(3)
Out[19]:
AP001 TD001 TD002 TD005 TD006 TD009 TD010 TD013 TD014 TD015 ... CD160 CD162 CD164 CD166 CD167 CD169 CD170 CD172 CD173 MB005
0 31 1 1 4 1 5 1 14 2 2 ... 8.0 13.0 13.0 0.0 0.0 1449.0 1449.0 2249.0 2249.0 7.0
1 27 2 0 3 1 3 1 3 2 0 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 33 2 1 4 1 5 1 9 1 2 ... 0.0 3.0 2.0 33.0 0.0 33.0 0.0 143.0 110.0 8.0

3 rows × 61 columns

'MEDIAN' imputation through iterative imputer

In [20]:
imp = IterativeImputer(missing_values=np.nan, sample_posterior=False, 
                                 max_iter=10, tol=0.001, 
                                 n_nearest_features=None, initial_strategy='median')
imp.fit(df_2)
Out[20]:
IterativeImputer(initial_strategy='median')
In [21]:
imputed_data_median = pd.DataFrame(data=imp.transform(df_2), 
                             columns=['AP001',
                'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
       'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
       'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
       'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
       'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
       'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
       'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
       'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'],
                             dtype='int')
In [22]:
imputed_data_median.head(3)
Out[22]:
AP001 TD001 TD002 TD005 TD006 TD009 TD010 TD013 TD014 TD015 ... CD160 CD162 CD164 CD166 CD167 CD169 CD170 CD172 CD173 MB005
0 31 1 1 4 1 5 1 14 2 2 ... 8 13 13 0 0 1449 1449 2249 2249 7
1 27 2 0 3 1 3 1 3 2 0 ... 7 13 10 531 465 1086 939 1651 1425 5
2 33 2 1 4 1 5 1 9 1 2 ... 0 3 2 33 0 33 0 143 110 8

3 rows × 61 columns

2.6. Convert the loan amount column CR009 to a category variable and bin appropriately

In [23]:
df['CR009'] = pd.cut(x=df['CR009'], bins=[-1, 100000, 200000, 300000, 400000, 500000, 600000, 700000, 800000, 900000, 1000000, 1500000])
df = df.astype({'CR009':'object'})
In [24]:
df.CR009.value_counts()
Out[24]:
(-1, 100000]          74142
(100000, 200000]       4125
(200000, 300000]        975
(300000, 400000]        353
(400000, 500000]        166
(500000, 600000]         95
(600000, 700000]         48
(700000, 800000]         32
(1000000, 1500000]       31
(800000, 900000]         19
(900000, 1000000]        14
Name: CR009, dtype: int64

3. EDA of variables and binning

3.1. Check the Correlation using correlation plot

I will check this for the variables which are not direct counts or lengths or days.

The variables I use are the ones which are marked at Credit Center data.

In [25]:
corr = df[['loan_default', 'AP001', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010', 'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024']].corr()
f,ax = plt.subplots(figsize=(18,12))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
In [26]:
# Remove 1 feeature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables

corr_var_drop1 = ['TD005', 'TD022', 'TD006', 'TD009', 'TD013', 'TD023', 'TD010', 'TD014']

Check the correlations for the counts, lengths, and days columns

I will be using the other variables as they are all Call detail data.

In [27]:
filter_col = [col for col in df if col.startswith('CD')]
filter_col.append('loan_default')
corr = df[filter_col].corr()
f,ax = plt.subplots(figsize=(21,21))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f',ax=ax)
plt.show()
In [28]:
# Remove 1 feature from a pair which has over 0.7 ratio
# However, H2O deals with this problem smartly, I will not remove the variables

corr_var_drop2 = ['CD173', 'CD172', 'CD170', 'CD169', 'CD167', 'CD166', 'CD164', 'CD162',
                 'CD137', 'CD136', 'CD135', 'CD133', 'CD132', 'CD131', 'CD117', 'CD118',
                 'CD120', 'CD121', 'CD123', 'CD114', 'CD113', 'CD108', 'CD107', 'CD106',
                 'CD101', 'CD072']

3.2. Create categorical variables using qcut, and assign function to bin all continuous variables

These are for the raw data, not the NA imputed one.

  • You may need to split a continuous variable into a categorial variable
  • Notice how the NoData category is added for "NA".
In [29]:
df_bin = df.copy(deep = True)
df_bin.head(2)
Out[29]:
loan_default AP001 AP002 AP003 AP004 AP006 AP007 AP008 AP009 TD001 ... CD169 CD170 CD172 CD173 MB005 MB007 Loan_app_day_name Loan_app_month Loan_app_time Loan_app_day
0 1 31 2 1 12 ios 3 3 1 1 ... 1449.0 1449.0 2249.0 2249.0 7.0 IPHONE7 Thursday July 10:21:00 6
1 0 27 1 1 12 h5 5 4 0 2 ... NaN NaN NaN NaN NaN WEB Thursday April 12:51:00 6

2 rows × 76 columns

In [30]:
# Write a function and loop through 
def binning(var):
    df_bin[var + '_bin'] = pd.qcut(df_bin[var],15,duplicates='drop').values.add_categories("NoData")
    df_bin[var + '_bin'] = df_bin[var + '_bin'].fillna("NoData").astype(str)
    df_bin[var + '_bin'].value_counts(dropna=False)
In [31]:
features = ['AP001', # 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
       # 'AP008', 'AP009',
       'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
       'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
       #'CR009', 'CR015', 
       'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
       'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
       'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
       'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
       'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
       'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005'
       # 'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
       # 'Loan_app_day'
           ]
In [32]:
for i in features:
    binning(i)
In [33]:
# View the bins of some variables 

print(df_bin['TD001_bin'].value_counts(dropna=False))
print(df_bin['TD022_bin'].value_counts(dropna=False))
(-0.001, 1.0]    33040
(1.0, 2.0]       22199
(2.0, 3.0]       12186
(3.0, 4.0]        6152
(5.0, 20.0]       3510
(4.0, 5.0]        2913
Name: TD001_bin, dtype: int64
(9.999, 15.0]    36564
NoData           19598
(15.0, 20.0]      9462
(25.0, 30.0]      8420
(20.0, 25.0]      5956
Name: TD022_bin, dtype: int64

3.3. Distributions of the numerical data, and the % Y by X which is the mean column for all the numerical columns here

This will help identify if mean or median is a better imputation for NA's and also help bin better manually.

Also, this will help in feature selection moving forward.

The 'mean' column represents the '% Y by X'.

In [34]:
def plot_X_and_Y(var):
    
    z = df_bin.groupby(var + '_bin')['loan_default'].agg(['count','mean']).reset_index() 
    z['count_pcnt'] = z['count']/z['count'].sum()
    x = z[var + '_bin']
    y_mean = z['mean']
    count_pcnt = z['count_pcnt']
    ind = np.arange(0, len(x))
    width = .5

    fig = plt.figure(figsize=(16,4))
    plt.subplot(121)
    plt.bar(ind, count_pcnt, width, color='r')
    #plt.ylabel('X')
    plt.title(var + ' Distribution')
    plt.xticks(ind,x.tolist(), rotation=45)

    plt.subplot(122)
    plt.bar(ind, y_mean, width, color='b')
    #plt.ylabel('Y by X')
    plt.xticks(ind,x.tolist(), rotation=45)
    plt.tight_layout()
    plt.title('Response mean by ' + var)
    plt.show()
    
#for i in features:
#    plot_X_and_Y(i)  
    

3.4. Distributions of the categorical data, and the % Y by X which is the mean column for all the Categorical columns here

This will help identify if mean or median is a better imputation for NA's and also help bin better manually.

Also, this will help in feature selection moving forward.

The 'mean' column represents the '% Y by X'.

In [35]:
features_2 = ['AP002', 'AP003', 'AP004', 'AP006', 'AP007', 'AP008', 'AP009',
       'CR009','CR015', 'MB007', 'Loan_app_day_name', 'Loan_app_month',
       'Loan_app_day'
           ]
In [36]:
def plot_X_and_Y_cat(var):
    
    z = df_bin.groupby(var)['loan_default'].agg(['count','mean']).reset_index() 
    z['count_pcnt'] = z['count']/z['count'].sum()
    x = z[var]
    y_mean = z['mean']
    count_pcnt = z['count_pcnt']
    ind = np.arange(0, len(x))
    width = .5

    fig = plt.figure(figsize=(16,4))
    plt.subplot(121)
    plt.bar(ind, count_pcnt, width, color='r')
    plt.ylabel('X')
    plt.title(var + ' Distribution')
    plt.xticks(ind,x.tolist(), rotation=45)

    plt.subplot(122)
    plt.bar(ind, y_mean, width, color='b')
    plt.ylabel('Y by X')
    plt.xticks(ind,x.tolist(), rotation=45)
    plt.tight_layout()
    plt.title('Response mean by ' + var)
    plt.show()


for i in features_2:
    plot_X_and_Y_cat(i)  

Obervations:

From the above graphs, the following variables seem to be not important, as they do not have a pattern or a trend, or a curve on the '% Y by x' graph:

  1. Loan_app_day_name

Other EDA

3.5. Showing the Distribution of X

In [37]:
df_count = df['AP006'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP006 - OS Type','Count']
print(df_count.head())

fig = px.bar(df_count, x = 'AP006 - OS Type', y = 'Count', color = 'AP006 - OS Type',
             width=600, height=400,
            title = "Distribution of OS type")
fig.show()
  AP006 - OS Type  Count
0              h5  44246
1             ios  17159
2         android  17140
3             api   1455
In [38]:
df_count = df['AP002'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP002 - Gender','Count']
print(df_count.head())

fig = px.bar(df_count, x = 'AP002 - Gender', y = 'Count', color = 'AP002 - Gender',
             width=600, height=400,
            title = "Distribution of Gender")
fig.show()
   AP002 - Gender  Count
0               1  54255
1               2  25745
In [39]:
df_count = df['AP003'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['AP003 - Education','Count']
print(df_count.head())

fig = px.bar(df_count, x = 'AP003 - Education', y = 'Count', color = 'AP003 - Education',
             width=600, height=400,
            title = "Distribution of Education")
fig.show()
   AP003 - Education  Count
0                  1  45079
1                  3  23829
2                  4  10846
3                  5    232
4                  6     14
In [40]:
fig = px.box(df, x="TD001",width=1000, height=500,
            title = "Distribution of TD001 - TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()
In [41]:
fig = px.box(df, x="MB005",width=1000, height=500,
            title = "Distribution of MB005")
fig.show()

3.6. Showing the Distribution of Y by another Categorical Variable X

In [42]:
fig = px.box(df, x="AP007", y="TD001",width=900, height=400,
             color = "AP002",
            title = "The Distribution of Level Application City by TD_CNT_QUERY_LAST_7Day_P2P")
fig.show()

3.7. Showing interaction of two or three variables

In [43]:
fig = sns.pairplot(df[['AP002', 'AP003', 'AP004']], 
             hue= 'AP004')
fig
Out[43]:
<seaborn.axisgrid.PairGrid at 0x7fbc5c789510>

4. Model

Use the non-binned data, fill NA's with Iterative Imputer Median

Using the median imputed values from the Iterative Imputer

In [44]:
# Over write the NA value columns, with the previously calculated values
df[features_nan] = imputed_data_median
In [45]:
df.head(2)
Out[45]:
loan_default AP001 AP002 AP003 AP004 AP006 AP007 AP008 AP009 TD001 ... CD169 CD170 CD172 CD173 MB005 MB007 Loan_app_day_name Loan_app_month Loan_app_time Loan_app_day
0 1 31 2 1 12 ios 3 3 1 1 ... 1449 1449 2249 2249 7 IPHONE7 Thursday July 10:21:00 6
1 0 27 1 1 12 h5 5 4 0 2 ... 1086 939 1651 1425 5 WEB Thursday April 12:51:00 6

2 rows × 76 columns

In [46]:
df.isnull().sum().sum()
Out[46]:
0

4.1. H2O Random Forest Model

In [47]:
import h2o
h2o.init()

from h2o.estimators.random_forest import H2ORandomForestEstimator
Checking whether there is an H2O instance running at http://localhost:54321 . connected.
H2O_cluster_uptime: 1 hour 23 mins
H2O_cluster_timezone: America/New_York
H2O_data_parsing_timezone: UTC
H2O_cluster_version: 3.30.1.3
H2O_cluster_version_age: 1 month and 1 day
H2O_cluster_name: H2O_from_python_harshdhanuka_zvf12h
H2O_cluster_total_nodes: 1
H2O_cluster_free_memory: 1.095 Gb
H2O_cluster_total_cores: 8
H2O_cluster_allowed_cores: 8
H2O_cluster_status: locked, healthy
H2O_connection_url: http://localhost:54321
H2O_connection_proxy: {"http": null, "https": null}
H2O_internal_security: False
H2O_API_Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python_version: 3.7.4 final

Splitting the data to train-test, and convert to h2o format

In [48]:
train,test = train_test_split(df,test_size=0.25, random_state = 1234)

# Convert to a  h2o dataframe for computation
df_hex = h2o.H2OFrame(df)
train_hex = h2o.H2OFrame(train)
test_hex = h2o.H2OFrame(test)

# This test_hex will be used all througout the models
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%

Define the target and predictor variables

In [49]:
# Selecting all predictor variables

predictors = ['AP001', 'AP002', 'AP003', 'AP004', 'AP006', 'AP007',
       'AP008', 'AP009', 'TD001', 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
       'TD013', 'TD014', 'TD015', 'TD022', 'TD023', 'TD024', 'CR004', 'CR005',
       'CR009', 'CR015', 'CR017', 'CR018', 'CR019', 'PA022', 'PA023', 'PA028',
       'PA029', 'PA030', 'PA031', 'CD008', 'CD018', 'CD071', 'CD072', 'CD088',
       'CD100', 'CD101', 'CD106', 'CD107', 'CD108', 'CD113', 'CD114', 'CD115',
       'CD117', 'CD118', 'CD120', 'CD121', 'CD123', 'CD130', 'CD131', 'CD132',
       'CD133', 'CD135', 'CD136', 'CD137', 'CD152', 'CD153', 'CD160', 'CD162',
       'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005',
       'MB007', 'Loan_app_day_name', 'Loan_app_month', 'Loan_app_time',
       'Loan_app_day']

target = 'loan_default'
In [50]:
len(predictors)
Out[50]:
75
In [51]:
len(df.columns.to_list())
Out[51]:
76

Train the H2O Model

In [52]:
RF_h2o_modl1 = H2ORandomForestEstimator(
        model_id = 'RF_h2o_modl1',
        ntrees = 400,
        nfolds=10,
        min_rows=100,
        seed=1234)

RF_h2o_modl1.train(predictors,target,training_frame = train_hex)
drf Model Build progress: |███████████████████████████████████████████████| 100%

Check Variable Importance

In [53]:
var_imps = pd.DataFrame(RF_h2o_modl1.varimp(), columns = ['Variable', 'Relative_Importance', 
                                               'Scaled_Importance', 'Percentage'])

Re-build Model with 55 most Important features

The model total has 75 features. After running the initial model with all features, I run many different combinations to get the best LIFT score. I see that the LIFT is best with the top 55 features in the model. So, I will select the Top 55 features, and re-run the model.

I tried selecting many different combination of features, but the best Lift was given by 50 features.

In [54]:
predictors = var_imps['Variable'].head(55).to_list()

target = 'loan_default'
In [55]:
RF_h2o_modl1 = H2ORandomForestEstimator(
        model_id = 'RF_h2o_modl1',
        ntrees = 400,
        nfolds=10,
        min_rows=100,
        seed=1234)

RF_h2o_modl1.train(predictors,target,training_frame = train_hex)
drf Model Build progress: |███████████████████████████████████████████████| 100%
In [56]:
var_imps = pd.DataFrame(RF_h2o_modl1.varimp(), columns = ['Variable', 'Relative_Importance', 
                                               'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
The 10 most important features are: 

Out[56]:
Variable Relative_Importance Scaled_Importance Percentage
0 TD013 31743.115234 1.000000 0.131650
1 AP003 26869.242188 0.846459 0.111437
2 AP004 23823.884766 0.750521 0.098806
3 TD009 20168.527344 0.635367 0.083646
4 MB007 17379.031250 0.547490 0.072077
5 TD005 13576.405273 0.427696 0.056306
6 CR015 8044.451660 0.253424 0.033363
7 TD014 7797.213379 0.245635 0.032338
8 MB005 6884.897461 0.216894 0.028554
9 Loan_app_day_name 6594.753418 0.207754 0.027351
In [57]:
def VarImp(model_name,m_name):
    
    # plot the variable importance
    plt.rcdefaults()
    variables = model_name._model_json['output']['variable_importances']['variable']
    y_pos = np.arange(len(variables))
    fig, ax = plt.subplots(figsize = (8,16))
    scaled_importance = model_name._model_json['output']['variable_importances']['scaled_importance']
    ax.barh(y_pos,scaled_importance,align='center',color='green')
    ax.set_yticks(y_pos)
    ax.set_yticklabels(variables)
    ax.invert_yaxis()
    ax.set_xlabel('Scaled Importance')
    ax.set_title(m_name + ' Variable Importance in Decreasing Order')
    plt.show()
In [58]:
VarImp(RF_h2o_modl1,'Random Forest')

Predict on Test Set

Use the original test_hex set for prediction

In [59]:
def actual_predict(model,test_hex,target):
    y_pred = model.predict(test_hex).as_data_frame()
    y_actual = test_hex[target].as_data_frame()
    df_actual_predict = pd.concat([y_actual,y_pred],axis=1)
    df_actual_predict.columns = ['actual','pred']
    return(df_actual_predict)
In [60]:
RF_actual_predict = actual_predict(RF_h2o_modl1,test_hex,target)
RF_actual_predict.head()
drf prediction progress: |████████████████████████████████████████████████| 100%
Out[60]:
actual pred
0 0 0.181036
1 0 0.069164
2 0 0.365635
3 0 0.197441
4 0 0.099418

ROC and Area Under Curve

A receiver operating characteristic curve, or ROC curve, is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was developed for operators of military radar receivers, which is why it is so named.

In [61]:
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
Out[61]:
0.6845790610258718

Gain Table

In [62]:
def gains_table(df_actual_predict):
    df_actual_predict = df_actual_predict.sort_values(by='pred',ascending=False)
    df_actual_predict['row_id'] = range(0,0+len(df_actual_predict))
    
    df_actual_predict['decile'] = (df_actual_predict['row_id'] / (len(df_actual_predict)/10)).astype(int)
    df_actual_predict.loc[df_actual_predict['decile'] == 10] =9
    
    # Create gains table
    gains = df_actual_predict.groupby('decile')['actual'].agg(['count','sum'])
    gains.columns = ['count','actual']
    gains

    gains['non_actual'] = gains['count'] - gains['actual']
    gains['cum_count'] = gains['count'].cumsum()
    gains['cum_actual'] = gains['actual'].cumsum()
    gains['cum_non_actual'] = gains['non_actual'].cumsum()
    gains['percent_cum_actual'] = (gains['cum_actual'] / np.max(gains['cum_actual'])).round(2)
    gains['percent_cum_non_actual'] = (gains['cum_non_actual'] / np.max(gains['cum_non_actual'])).round(2)
    gains['if_random'] = np.max(gains['cum_actual']) /10
    gains['if_random'] = gains['if_random'].cumsum()
    gains['lift'] = (gains['cum_actual'] / gains['if_random']).round(2)
    gains['K_S'] = np.abs( gains['percent_cum_actual'] - gains['percent_cum_non_actual']  ) * 100 
    gains['gain'] = (gains['cum_actual'] / gains['cum_count']*100).round(2)
    return(gains)
In [63]:
RF_gains = gains_table(RF_actual_predict)
RF_gains
Out[63]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 2000 826 1174 2000 826 1174 0.21 0.07 391.8 2.11 14.0 41.30
1 2000 628 1372 4000 1454 2546 0.37 0.16 783.6 1.86 21.0 36.35
2 2000 493 1507 6000 1947 4053 0.50 0.25 1175.4 1.66 25.0 32.45
3 2000 435 1565 8000 2382 5618 0.61 0.35 1567.2 1.52 26.0 29.78
4 2000 397 1603 10000 2779 7221 0.71 0.45 1959.0 1.42 26.0 27.79
5 2000 334 1666 12000 3113 8887 0.79 0.55 2350.8 1.32 24.0 25.94
6 2000 303 1697 14000 3416 10584 0.87 0.66 2742.6 1.25 21.0 24.40
7 2000 228 1772 16000 3644 12356 0.93 0.77 3134.4 1.16 16.0 22.78
8 2000 176 1824 18000 3820 14180 0.97 0.88 3526.2 1.08 9.0 21.22
9 2000 98 1902 20000 3918 16082 1.00 1.00 3918.0 1.00 0.0 19.59

ROC with Precision Recall

In [64]:
def ROC_PR(df_actual_predict):
    
    print('')
    print('   * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate')
    print('')
    print('   * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy')
    print('')
    print('   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)')
    print('')
    
    #  ROC
    roc_auc_value = roc_auc_score(df_actual_predict['actual'],df_actual_predict['pred'])
    fpr, tpr, _ = roc_curve(df_actual_predict['actual'],df_actual_predict['pred'])
    roc_auc = auc(fpr,tpr)
    
    lw=2
    plt.figure(figsize=(10,4))
    plt.subplot(1,2,1)
    plt.plot(fpr,tpr, color='darkorange',lw=lw,label='AUC = %0.4f)' %roc_auc_value)
    plt.plot([0,1],[0,1], color='navy',lw=lw,linestyle='--')
    plt.xlim([0,1])
    plt.ylim([0,1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve: AUC={0:0.4f}'.format(roc_auc_value))
    plt.legend(loc='lower right')
    
    # Precision-Recall
    plt.subplot(1,2,2)
    average_precision = average_precision_score(df_actual_predict['actual'],df_actual_predict['pred'])
    precision, recall, _ = precision_recall_curve(df_actual_predict['actual'],df_actual_predict['pred'])
    plt.step(recall, precision, color='b', alpha=0.2, where='post')
    plt.fill_between(recall,precision,step='post',alpha=0.2,color='b')
    plt.xlabel('Recall')
    plt.ylabel('Precision')
    plt.ylim([0.0,1.05])
    plt.ylim([0.0,1.05])
    plt.title('Precision-Recall curve: PR={0:0.4f}'.format(average_precision))
In [65]:
ROC_PR(RF_actual_predict)
   * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate

   * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

4.2. H2O Random Forest Model, using Balance Classes

Use H2O's "balance_classes"

  • The balance_classes option can be used to balance the class distribution. When enabled, H2O will either undersample the majority classes or oversample the minority classes.
  • Note that the resulting model will also correct the final probabilities (“undo the sampling”) using a monotonic transform, so the predicted probabilities of the first model will differ from a second model. However, because AUC only cares about ordering, it won’t be affected.

If this option is enabled, then we can also specify a value for the class_sampling_factors and max_after_balance_size options to control the sampling:

  • class_sampling_factors takes a list of numbers which would be the sampling rate for each class. A value of 1 would not change the sample rate for a class, but setting it to 0.5 would reduce its sampling by half, and 2 would double its sample rate.
  • max_after_balance_size is the max relative size your training data can be grown. By default, it is 5: this will oversample the data to rebalance the training data. The max it can grow to is 5x larger than your original data, hence, the value of 5. If you have many rows and prefer to under-sample the majority class, you can set max_after_balance_size to a value of less than 1.

  • See this H2O page.'

I will follow the same procedure as above, and just add the new parameter.

In [66]:
RF_h2o_modl2 = H2ORandomForestEstimator(
        model_id = 'RF_h2o_modl2',
        ntrees = 400,
        nfolds=10,
        min_rows=100,
        balance_classes = True,
        seed=1234)

RF_h2o_modl2.train(predictors,target,training_frame = train_hex)
drf Model Build progress: |███████████████████████████████████████████████| 100%
In [67]:
RF_actual_predict = actual_predict(RF_h2o_modl2,test_hex,target)
RF_actual_predict.head()
drf prediction progress: |████████████████████████████████████████████████| 100%
Out[67]:
actual pred
0 0 0.181036
1 0 0.069164
2 0 0.365635
3 0 0.197441
4 0 0.099418
In [68]:
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
Out[68]:
0.6845790610258718
In [69]:
RF_gains = gains_table(RF_actual_predict)
RF_gains
Out[69]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 2000 826 1174 2000 826 1174 0.21 0.07 391.8 2.11 14.0 41.30
1 2000 628 1372 4000 1454 2546 0.37 0.16 783.6 1.86 21.0 36.35
2 2000 493 1507 6000 1947 4053 0.50 0.25 1175.4 1.66 25.0 32.45
3 2000 435 1565 8000 2382 5618 0.61 0.35 1567.2 1.52 26.0 29.78
4 2000 397 1603 10000 2779 7221 0.71 0.45 1959.0 1.42 26.0 27.79
5 2000 334 1666 12000 3113 8887 0.79 0.55 2350.8 1.32 24.0 25.94
6 2000 303 1697 14000 3416 10584 0.87 0.66 2742.6 1.25 21.0 24.40
7 2000 228 1772 16000 3644 12356 0.93 0.77 3134.4 1.16 16.0 22.78
8 2000 176 1824 18000 3820 14180 0.97 0.88 3526.2 1.08 9.0 21.22
9 2000 98 1902 20000 3918 16082 1.00 1.00 3918.0 1.00 0.0 19.59
In [70]:
ROC_PR(RF_actual_predict)
   * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate

   * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Observations:

I see that in both the random forest models, first without the balance classes, and the second one with balance classes, the output result is the same.

This is probably because in our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. Hence, the underlying assumption for the same results in both models is that this dataset is not a bad distribution of class, and 20% data in a one class is fine for building a stable model. If we had more than 5-6 times of the minority class data in the majority class, then that would be be settled well by the balance classes parameter.

Also, for the train test split, I found that a split of 75:25 works best for all my models. I tried other combination, but a 75% train gives me the best Lift score of 2.11 in both the models. So, I will continue with the same split ratio further.

The optimum hyper-parameters for both the models built above are:

  • ntrees = 400
  • nfolds = 10
  • min-rows = 100
  • balance_classes is indifferent

Please see below for the meaningful business insights:

Business Insight:

H2O package is a very effective and efficint package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.

Gains table and Lift: For the final model I built after running tuning the models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtaned is 2.11, which is good as per industry standards. A Lift score of above 2 is suitablefor the model to be of acceptable standards.

ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.

However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us. We get an AUC of 0.68 and PR of 0.34, which is acceptable, but I will try to improve them further in my following models.

The major outcome of this excercise is that random forest is a good model to predict loan default, as per the given data. However, we should not undermine other good boosting models such as gbm, xgboost, or Auto-ML and others. These might provide better results as well.

4.3. Plotting the data points for visualization, under-sampling

Check the sample sizes of the 0's and 1's, Visualizing the data points distribution of loan default, or 1's and 0's in the dataset

In [71]:
# Select a few numeric variables for the PCA

num_variables = ['loan_default', 'AP001', 'TD001',
                 'TD002', 'TD005', 'TD006', 'TD009', 'TD010',
       'CD164', 'CD166', 'CD167', 'CD169', 'CD170', 'CD172', 'CD173', 'MB005']
In [72]:
# Create a new temporary df with only the numeric columns for PCA
df_new = df[num_variables]
In [73]:
target = 'loan_default'
In [74]:
y = df_new[target]
X = df_new.drop(target,axis=1)
y.dtypes
Out[74]:
dtype('int64')
In [75]:
y1_cnt = df_new[target].sum()
print(y1_cnt)
15488
In [76]:
# Taking twice the number of 0's as compared to 1's. Currently, its 4 times. 

N = 2
y0_cnt = y1_cnt * N
y0_cnt
Out[76]:
30976
In [77]:
y.value_counts()
Out[77]:
0    64512
1    15488
Name: loan_default, dtype: int64
In [78]:
# Adjust the temporary dataset as per the 2:1 ratio we defined above

X_rs, y_rs = make_imbalance(X, y, 
                            sampling_strategy={1:y1_cnt , 0:  y0_cnt},
                            random_state=0)
#X_rs = pd.DataFrame(X_rs)
#y_rs = pd.DataFrame(y_rs)    
In [79]:
y_rs.value_counts()
Out[79]:
0    30976
1    15488
Name: loan_default, dtype: int64
In [80]:
smpl = pd.concat([X_rs,y_rs], axis = 1)
In [81]:
# Verify
smpl['loan_default'].value_counts()
Out[81]:
0    30976
1    15488
Name: loan_default, dtype: int64

For ease of visualization, I use principal component analysis to reduce the dimensions and pick the first two principal components. A scatterplot for the dataset is shown below.

In [82]:
def plot_this(X_rs,y_rs,method):
  # Use principal component to condense the features to 2 features
    pca = PCA(n_components=2).fit(X_rs)
    pca_2d = pca.transform(X_rs)
      # Assign colors
    for i in range(0, pca_2d.shape[0]):
        if y_rs[i] == 0:
            c1 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='r', marker='o')
        elif y_rs[i] == 1:
            c2 = pl.scatter(pca_2d[i,0],pca_2d[i,1],c='g', marker='*')  
    pl.legend([c1, c2], ['Class 1', 'Class 2'])
    pl.title(method)
    pl.axis([-4, 5, -4, 4])  # x axis (-4,5), y axis (-4,4)
    pl.show()
    
# plot_this(X,y,'Original')

# This takes a lot of time to run, I will comment it
In [83]:
print('Random UnderSampler {}'.format(Counter(y_rs)))


#plot_this(X_rs,y_rs,'Random undersampling')

# This takes a lot of time to run, I will comment it
Random UnderSampler Counter({0: 30976, 1: 15488})

4.4. Random Forest Model with manual Under-Sampling

I will manually perform the Random Under Sampling using the functon from the imblearn package, and then build the H2O Random Forest Model.

In [84]:
df2 = df.copy()
In [85]:
target = 'loan_default'
In [86]:
y = df2[target]
X = df2.drop(target,axis=1)
y.dtypes
Out[86]:
dtype('int64')
In [87]:
# Check Statistics
y1_cnt = df2[target].sum()
print(y1_cnt)
df['loan_default'].value_counts()

print("The percent of data with value = 1 or loan default = 1 in the target column is")
(df2['loan_default'].value_counts(dropna=False)[1]/df2.shape[0])*100
15488
The percent of data with value = 1 or loan default = 1 in the target column is
Out[87]:
19.36
In [88]:
# I plan to reduce the majorty class to only double of the minority class, its currently 4 times
df['loan_default'].value_counts(dropna=False)[1]*2
Out[88]:
30976
In [89]:
y.value_counts()
Out[89]:
0    64512
1    15488
Name: loan_default, dtype: int64
In [90]:
from imblearn.under_sampling import RandomUnderSampler

# Here, I will decrease the size of the  bigger sample class to 2 times the minority

sampler = RandomUnderSampler(sampling_strategy={1: 15488, 0: 30976},random_state=0)
In [91]:
X_rs, y_rs = sampler.fit_sample(X, y)
In [92]:
print('RandomUnderSampler {}'.format(Counter(y_rs)))
RandomUnderSampler Counter({0: 30976, 1: 15488})
In [93]:
# Verify
y_rs.value_counts()
Out[93]:
0    30976
1    15488
Name: loan_default, dtype: int64
In [94]:
X_rs = pd.DataFrame(X_rs)
y_rs = pd.DataFrame(y_rs)
In [95]:
smpl = pd.concat([X_rs,y_rs], axis = 1)
In [96]:
smpl_hex = h2o.H2OFrame(smpl)
Parse progress: |█████████████████████████████████████████████████████████| 100%

Note:

An important note here is that, I will be using the entire sample data, which is essentially a mis-represented but balanced dataset for the training purpose.

But, for the predicting on the test data, I will be using the original test_hex data that I had defined above. I do not want to be predicting on a mis-represented test data. Hence, for this manual under sampling model, train_test split has been ignored, as the test_hex already exists.

In [97]:
rf_u_sample = H2ORandomForestEstimator(
        model_id = 'rf_u_sample',
        ntrees = 400,
        nfolds=10,
        min_rows=100,
        seed=1234)
rf_u_sample.train(predictors,target,training_frame=smpl_hex)
drf Model Build progress: |███████████████████████████████████████████████| 100%
In [98]:
RF_actual_predict = actual_predict(rf_u_sample,test_hex,target)
RF_actual_predict.head()

# The 'test_hex' which will be used here will be original test data, not a test split from the new balanced dataset.
drf prediction progress: |████████████████████████████████████████████████| 100%
Out[98]:
actual pred
0 0 0.345559
1 0 0.112509
2 0 0.493005
3 0 0.339093
4 0 0.165406
In [99]:
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
Out[99]:
0.727278012208869
In [100]:
RF_gains = gains_table(RF_actual_predict)
RF_gains
Out[100]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 2000 930 1070 2000 930 1070 0.24 0.07 391.8 2.37 17.0 46.50
1 2000 679 1321 4000 1609 2391 0.41 0.15 783.6 2.05 26.0 40.22
2 2000 507 1493 6000 2116 3884 0.54 0.24 1175.4 1.80 30.0 35.27
3 2000 480 1520 8000 2596 5404 0.66 0.34 1567.2 1.66 32.0 32.45
4 2000 382 1618 10000 2978 7022 0.76 0.44 1959.0 1.52 32.0 29.78
5 2000 311 1689 12000 3289 8711 0.84 0.54 2350.8 1.40 30.0 27.41
6 2000 271 1729 14000 3560 10440 0.91 0.65 2742.6 1.30 26.0 25.43
7 2000 175 1825 16000 3735 12265 0.95 0.76 3134.4 1.19 19.0 23.34
8 2000 129 1871 18000 3864 14136 0.99 0.88 3526.2 1.10 11.0 21.47
9 2000 54 1946 20000 3918 16082 1.00 1.00 3918.0 1.00 0.0 19.59
In [101]:
ROC_PR(RF_actual_predict)
   * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate

   * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

4.5. Random Forest Model with manual Over-Sampling

I will manually perform the Random Over Sampling using the functon from the imblearn package, and then build the H2O Random Forest Model.

In [102]:
df2 = df.copy()
In [103]:
target = 'loan_default'
In [104]:
y = df2[target]
X = df2.drop(target,axis=1)
y.dtypes
Out[104]:
dtype('int64')
In [105]:
# Check Statistics
y1_cnt = df2[target].sum()
print(y1_cnt)
print(df2['loan_default'].value_counts())

print("The percent of data with value = 1 or loan default = 1 in the target column is")
(df2['loan_default'].value_counts(dropna=False)[1]/df2.shape[0])*100
15488
0    64512
1    15488
Name: loan_default, dtype: int64
The percent of data with value = 1 or loan default = 1 in the target column is
Out[105]:
19.36
In [106]:
# I plan to triple the minority class
df['loan_default'].value_counts(dropna=False)[1]*4
Out[106]:
61952
In [107]:
from imblearn.over_sampling import RandomOverSampler

# RandomOverSampler
  # With over-sampling methods, the number of samples in a class
  # should be greater or equal to the original number of samples.


# Here, I will increase the size of the  smaller sample class to 2 times its original

sampler = RandomOverSampler(sampling_strategy={1: 64512, 0: 64512},random_state=0)
In [108]:
X_rs, y_rs = sampler.fit_sample(X, y)
print('RandomOverSampler {}'.format(Counter(y_rs)))
# plot_this(X_rs,y_rs)
RandomOverSampler Counter({1: 64512, 0: 64512})
In [109]:
X_rs = pd.DataFrame(X_rs)
y_rs = pd.DataFrame(y_rs)
In [110]:
smpl = pd.concat([X_rs,y_rs], axis = 1)
In [111]:
train,test = train_test_split(smpl,test_size=0.25, random_state = 1234)

# Convert to a  h2o dataframe for computation
smpl_hex = h2o.H2OFrame(smpl)
train_hex = h2o.H2OFrame(train)
Parse progress: |█████████████████████████████████████████████████████████| 100%
Parse progress: |█████████████████████████████████████████████████████████| 100%

Note:

An important note here is that, I will be using the entire sample data, which is essentially a mis-represented but balanced dataset for the training purpose.

But, for the predicting on the test data, I will be using the original test_hex data that I had defined above. I do not want to be predicting on a mis-represented test data. Hence, for this manual under sampling model, train_test split has been ignored, as the test_hex already exists.

In [112]:
rf_o_sample = H2ORandomForestEstimator(
        model_id = 'rf_o_sample',
        ntrees = 400,
        nfolds=10,
        min_rows=100,
        seed=1234)
rf_o_sample.train(predictors,target,training_frame=smpl_hex)
drf Model Build progress: |███████████████████████████████████████████████| 100%
In [113]:
var_imps = pd.DataFrame(RF_h2o_modl1.varimp(), columns = ['Variable', 'Relative_Importance', 
                                               'Scaled_Importance', 'Percentage'])
print()
print("The 10 most important features are: ")
print()
var_imps.head(10)
The 10 most important features are: 

Out[113]:
Variable Relative_Importance Scaled_Importance Percentage
0 TD013 31743.115234 1.000000 0.131650
1 AP003 26869.242188 0.846459 0.111437
2 AP004 23823.884766 0.750521 0.098806
3 TD009 20168.527344 0.635367 0.083646
4 MB007 17379.031250 0.547490 0.072077
5 TD005 13576.405273 0.427696 0.056306
6 CR015 8044.451660 0.253424 0.033363
7 TD014 7797.213379 0.245635 0.032338
8 MB005 6884.897461 0.216894 0.028554
9 Loan_app_day_name 6594.753418 0.207754 0.027351
In [114]:
VarImp(RF_h2o_modl1,'Random Forest')
In [115]:
RF_actual_predict = actual_predict(rf_o_sample,test_hex,target)
RF_actual_predict.head()

# The 'test_hex' which will be used here will be original test data, not a test split from the new balanced dataset.
drf prediction progress: |████████████████████████████████████████████████| 100%
Out[115]:
actual pred
0 0 0.485829
1 0 0.136799
2 0 0.648903
3 0 0.521886
4 0 0.234849
In [116]:
dd = RF_actual_predict
RF_roc_auc_value = roc_auc_score(dd['actual'],dd['pred'])
RF_roc_auc_value
Out[116]:
0.7926184709692585
In [117]:
RF_gains = gains_table(RF_actual_predict)
RF_gains
Out[117]:
count actual non_actual cum_count cum_actual cum_non_actual percent_cum_actual percent_cum_non_actual if_random lift K_S gain
decile
0 2000 1178 822 2000 1178 822 0.30 0.05 391.8 3.01 25.0 58.90
1 2000 748 1252 4000 1926 2074 0.49 0.13 783.6 2.46 36.0 48.15
2 2000 546 1454 6000 2472 3528 0.63 0.22 1175.4 2.10 41.0 41.20
3 2000 467 1533 8000 2939 5061 0.75 0.31 1567.2 1.88 44.0 36.74
4 2000 320 1680 10000 3259 6741 0.83 0.42 1959.0 1.66 41.0 32.59
5 2000 261 1739 12000 3520 8480 0.90 0.53 2350.8 1.50 37.0 29.33
6 2000 180 1820 14000 3700 10300 0.94 0.64 2742.6 1.35 30.0 26.43
7 2000 129 1871 16000 3829 12171 0.98 0.76 3134.4 1.22 22.0 23.93
8 2000 66 1934 18000 3895 14105 0.99 0.88 3526.2 1.10 11.0 21.64
9 2000 23 1977 20000 3918 16082 1.00 1.00 3918.0 1.00 0.0 19.59
In [118]:
print()
print("The final best Lift score is: ")
print()
print(RF_gains['lift'][0])
The final best Lift score is: 

3.01
In [119]:
ROC_PR(RF_actual_predict)
   * ROC curve: The ROC curve plots the true positive rate vs. the false positive rate

   * The area under the curve (AUC): A value between 0.5 (random) and 1.0 (perfect), measuring the prediction accuracy

   * Recall (R) = The number of true positives / (the number of true positives + the number of false negatives)

Observations:

Here, I ran the random forest models, first with manual undersampling, and the second one with manual over sampling, the output differs largely in both models.

In our dataset, the minority class represents 19.3% or around 1/5th of the total data, and the majority class represents 81.7% or 4/5th of the total data, or around 4 times the minority class. So, doing manual under and over sampling helped change the model performance significantly.

Infact, in manual undersampling, the Lift score went up from 2.11 to 2.38 in my best model, and in manual over sampling, the Lift score went further up to 3.01.

Also, for the train test split, I found that a split of 75:25 works best for all my models. I tried other combination, but a 75% train gives me the best Lift score in all the models.

The optimum hyper-parameters for the Over sampling model with the best Lift score of 3.01 are:

  • ntrees = 400
  • nfolds = 10
  • min-rows = 100
  • Manually Over-sampled the minority class to make both classes equal

Please see below for the meaningful business insights:

Business Insight:

H2O package is a very effective and efficient package to build a machine learning model for predicting loan default. Also, H2O package is very handy to display the variable importance, handle correlations, and also dummy code the categorical variables.

Gains table and Lift: For the final model I built after tuning all the different models on various different values of each parameter and finally tuning all the hyper-parameters for the best result, the highest Lift score I obtaned is 3.01, which is very good as per industry standards. A Lift score of above 2 is suitable for the model to be of acceptable standards.

ROC and AUC: The area under the ROC curve (AUC) assesses overall classification performance. But, AUC does not place more emphasis on one class over the other, so it does not reflect the minority class well. The Precision-Recall (PR) curves will be more informative than ROC when dealing with highly skewed datasets. The PR curves plot precision vs. recall (FPR). Because Precision is directly influenced by class imbalance so the Precision-recall curves are better to highlight differences between models for highly imbalanced data sets.

However, both of them do not accurately represent the results, as one doesnt reflect the minority class well, and the other is sensitive to imbalanced data. Hence, we use the H2O package, which takes care of all these problems for us.

In my case, the AUC is 0.79 and PR score is 0.50 which is fine and acceptable.

The major outcome of this exercise is that random forest through the H2O package is a good approach to predict loan default. However, we should not undermine other good boosting models such as gbm, xgboost, or Auto-ML and others. These might provide better results as well.

In [ ]: